Associating information related to components in structured documents stored in their native format in a database

ABSTRACT

A method and system for associating information related to a component of a structured document that is stored in its native format in a database system includes generating a hierarchical node tree comprising a plurality of nodes, where each node represents a component in the structured document, and generating a path associated with each node, where the path follows the hierarchical structure of the structured document from a root node to the node. In each node, an identifier associated with the path is stored. A table is provided that correlates the identifier with information related to the associated path. The information applies to the component represented by the node.

CROSS REFERENCE

This application is a continuation-in-part of U.S. patent applicationSer. No. 10/648,760, filed Aug. 25, 2003, the disclosure of which isincorporated herein by reference in its entirety.

BACKGROUND

Embodiments of the present invention relate to storing and accessinginformation in a structured document in a computer implemented databasesystem.

Structured documents have nested structures, i.e., structures thatdefine hierarchical relationships between components of a document.Documents written in Extensible Markup Language (XML) are structureddocuments. XML is quickly becoming the standard format language forexchanging information over the Internet because it allows the user todesign a customized markup language for many classes of structureddocuments. For example, a business can easily model a complex structureof a document, such as a purchase order, in a form written in XML andsend the form for further processing to its business partners. XMLsupports user-defined tags for better description of nested documentstructures and associated semantics, and encourages the separation ofdocument content from browser presentation.

As more and more business applications use structured documents writtenin XML to present and exchange data over the Internet, the challenge isto store, search, and retrieve these documents using existing relationaldatabase systems. A relational database management system (RDBMS) is adatabase management system which uses relational techniques for storingand retrieving data. Relational databases are organized into tables,which consist of rows and columns of data. A database will typicallyhave many tables and each table will typically have multiple rows andcolumns. The tables are typically stored on direct access storagedevices (DASD), such as magnetic or optical disk drives, forsemi-permanent storage.

Some relational database systems store an XML document as a Binary LargeObject (BLOB). While storing the document as a BLOB is straightforward,accessing the data in the document presents challenges because a BLOB isnot easily queried. For example, each BLOB must be read and parsedbefore it can be queried. For a large number of BLOBs, this process canbe prohibitively costly and time consuming.

Other relational database systems store an XML document by mapping theXML data to rows and columns in one or more relational tables. Thisapproach, however, introduces inefficiencies, especially for large XMLdocuments. For example, mapping an XML document to a relational databasecan result in a large table having numerous columns with null values(which consumes valuable memory) or a large number of tables (which isinefficient and consumes memory).

Moreover, neither approach provides a way to store information relatedto the document data, e.g., metadata, that does not introducesignificant inefficiencies and require significant resources.Furthermore, neither approach preserves the nested structure of thedocument. Thus, parent-child(ren) relationships are difficult toreconstruct.

Accordingly, it is desirable to be able to store structured documents intheir native formats within a database system. It is also desirable toassociate related information with components of structured documents.It is also desirable to integrate structured documents into an existingdatabase system in order to use the existing resources of a databasesystem.

SUMMARY

A method and system for associating information related to a componentof a structured document that is stored in its native format in adatabase system includes generating a hierarchical node tree comprisinga plurality of nodes, where each node represents a component in thestructured document, and generating a path associated with each node,where the path follows the hierarchical structure of the structureddocument from a root node to the node. In each node, an identifierassociated with the path is stored. A table is provided that correlatesthe identifier with information related to the associated path. Theinformation applies to the component represented by the node.

DRAWINGS

These features, aspects and advantages of the present invention willbecome better understood with regard to the following description,appended claims, and accompanying drawings, which illustrate examples ofthe invention. However, it is to be understood that each of the featurescan be used in the invention in general, not merely in the context ofthe particular drawings, and the invention includes any combination ofthese features, where:

FIG. 1 is a block diagram of exemplary computer hardware that can beused to implement the present invention;

FIG. 2 is a block diagram of an XML Storage mechanism according to aversion of the present invention;

FIG. 3 is a high level flowchart illustrating a process for storing XMLdocuments in their native format in a database according to a version ofthe present invention;

FIG. 4 is a block diagram of an XML Record according to a version of thepresent invention;

FIG. 5 is a block diagram of a node according to a version of thepresent invention; and

FIG. 6 illustrates an exemplary path table according to a version of thepresent invention.

DESCRIPTION

Embodiments of the present invention relate to storing and accessinginformation in a structured document in a computer implemented databasesystem. Various modifications to the preferred embodiment and thegeneric principles and features described herein will be readilyapparent to those skilled in the art. For example, the followingdiscussion is presented in the context of a DB2® database environmentavailable from IBM® Corporation. It should be understood, however, thatthe present invention is not limited to DB2 and may be implemented withother relational database systems or with other native XML databasesystems. Thus, the present invention is to be accorded the widest scopeconsistent with the principles and features described herein.

An exemplary computer hardware environment that can be used with thepresent invention is illustrated in FIG. 1. A distributed computersystem 100 utilizes a network 103 to connect client computers 102executing client applications to a server computer 104 executingsoftware and other computer programs, and to connect the server computer104 to data sources 106. These components can be coupled to one anotherby various networks, including LANs, WANs, and the Internet. Each clientcomputer 102 and the server computer 104 additionally comprises anoperating system and one or more computer programs (not shown).

The server computer 104 uses a data store interface 107 for connectingto the data sources 106. The data store interface can be connected to adatabase management system (DBMS) 105, which supports access to the datastore 106. The DBMS 105 can be a relational database management system(RDBMS), such as the DB2® system developed by IBM Corporation, or italso can be a native XML database system. The interface and DBMS 105 canbe located at the server computer 104 or may be located on one or moreseparate machines. The data sources 106 may be geographicallydistributed.

The DBMS 105 and the instructions derived therefrom all comprise ofinstructions which, when read and executed by the server computer 104cause the server computer 104 to perform the steps necessary toimplement and/or use the present invention. Those skilled in the artwill recognize that the exemplary environment illustrated in FIG. 1 isnot intended to limit the present invention, and that alternativeenvironments may be used without departing from the scope of the presentinvention.

According to one version, the DBMS 105 includes an XML Storage mechanism200 that supports the storage of XML documents in their native format ondisk. Storing data “on disk” refers to storing data persistently, forexample, in the data store 106.

FIG. 2 is a block diagram of the XML Storage mechanism 200 according toa version of the present invention. The XML Storage mechanism 200comprises a parser 204, a node tree generator 206 and a path generator208.

Typically, a structured document can be represented by a plurality ofnodes. The term “node” is used in the Direct Object Model (DOM)-sense,which is a standard XML construct well known to those skilled in theart. In that construct, each node corresponds to a component of the XMLdocument. The component can be an element, an attribute, a value, orother XML construct. The parser 204 interprets the structured documentand identifies its node components. The node tree generator 206 receivesthe nodes representing the XML document and forms a hierarchical nodetree 208.

Each node of the XML document can be described by a path that definesthe hierarchical relationship between the node and its parent node(s).Every path begins at a root node corresponding to a root element andfollows the hierarchical structure defined by the XML document. The pathgenerator 210 generates a path for each node in the node tree 208.

FIG. 3 is a flowchart illustrating a method for storing XML documents intheir native format in a database according to a version of the presentinvention. Referring to FIG. 2 and FIG. 3, the XML Storage mechanism 200receives or retrieves an XML document 202 for storage (step 300). TheXML document 202 is parsed by the parser 204, e.g., a standard SAX(Simple API for XML) parser (step 302).

Next, the node tree generator 206 takes the parsed XML data andgenerates a plurality of nodes that form the hierarchical node tree 208that represents the XML document (step 304). For example, suppose astructured document 202 is as follows: <Questionnaire> <Name> Alice</Name> <Questions> <Q1> Yes </Q1> <Q2> No </Q2> </Questions><Questionnaire>The node tree 208 would comprise the following element nodes:

The node tree 208 preserves the hierarchical structure of the XMLdocument 202 and also preserves the document order, i.e., the order ofthe nodes.

After the node tree 208 is generated, the path generator 210 creates apath for each node in the node tree 208 (step 306). Thus, referring tothe example above, the path for node Q1 can be:

-   -   /Questionnaire/Questions/Q1

The plurality of nodes forming the node tree 208 is stored in an XMLRecord 400 and the paths are stored in a path table 600 (step 308). Thepath table 600 is preferably a standard relational table, e.g., withrows and columns, that is stored in the database 106 (FIG. 1).

The XML Record 400 is similar to a standard database record that storesrelational data except that the XML Record 400 stores XML data. Storingthe plurality of nodes in a record 400 is advantageous for severalreasons. A record 400, like an XML document, is variable in length.Records 400 also can be re-directed, providing a layer of indirectionthat insulates pointers into a tree, from e.g., within the tree itself,from indices, or from an anchor table (described below), if the record400 is moved to a different page. Moreover, the infrastructure for fixedpage buffer management, recovery, utilities (backup/restore), logging,locking, and replication can be reused.

FIG. 4 is a block diagram of an XML Record 400 according to a version ofthe present invention. As is shown, the XML Record 400 comprises a nodeslot array 406 and a plurality of nodes 408 a, 408 b, 408 representingat least one XML document 202. Each entry in a node slot 407 points to anode, e.g., 408, in the node tree 208.

As is shown, each node 408, 408 a, 408 b comprises an array of childpointers 410. Each child pointer 410 generally points to a node slot407, which in turn, points to a node, e.g., 408 b, corresponding to thechild. Thus, for example, Node B (408 b) is the child node of Node A(408 a). Child pointers 410 can be small because they only need to pointto a node slot 407. In certain circumstances a child pointer, e.g., 411,will point to an in-lined character array 412 in the same node, e.g.,408 b. The in-lined character array 412 contains information describingthe child. In other circumstances, the child pointer 410 itself willcomprise information describing the child and its value, i.e., it pointsto itself.

The structure and contents of the node will now be described withreference to FIG. 5, which is a block diagram of a node 500 according toa version of the present invention. The node 500 is generally dividedinto two sections, a node header section 510 and a child pointer section530. The node header section 510 comprises:

-   -   The node type 512;    -   The name of the node 514;    -   The namespace of this node 516;    -   Node slot number of the parent node 518; and    -   A Path Identifier 520.

The node type 512 identifies what component the node 500 represents inthe XML document 202. The node type (512) is typically an element,attribute, processing instruction or text node, although those skilledin the art recognize that other less common node types exist. The nodename (514) is typically the element or attribute name, or processinginstruction target. The namespace of the node (516) is the namespace ofthe element or attribute. The node slot number of the parent node (518)allows the node 500 to identify its parent node for navigation purposes.The path identifier 520, which correlates to a path, identifies the paththat is associated with the node.

The child pointer section 530 comprises at least three formats. Theformat for a child pointer 410 depends on where the child node resides,e.g., in a separate node (408 b) or in the parent node (408 a). Thefirst format (532) applies when the child pointer, e.g., 410 a, pointsto an ordered attribute, or internal child in a different node (e.g.,408 b). Generally, the node type of such a child will be of an elementor attribute type. The second format (534) applies when the childpointer 410 points to an ordered attribute or internal child in anin-lined character array 412 in the node (e.g., 408 b). Here, the childis part of its parent node. The third format (536) is applied when thechild pointer 410 itself fully describes the child and its value.

In this version, each node 500 is relatively small because it includesminimal information about itself, e.g., its component type, its name,its parent, as well as minimal information about its children. This isdesirable because a node tree 208 can comprise hundreds or thousands ofnodes 500 and those nodes 500 preferably can be stored in as few records400 as possible for purposes of performance and efficiency.

In certain circumstances, e.g., during query processing, otherinformation related to the node 500 is desirable or required. Typicallysuch related information is associated with the node's path because morethan one node 500 can share the same path. For example, access controlrules (ACRs) restrict access to the document 202 or portions of thedocument 202 to authorized users. Fine-grained ACRs control access atthe node-level and are typically defined by the paths associated withthe nodes 500. Similarly, schema information and statistical data, e.g.,distribution statistics, related to the node 500 are also generallyassociated with the node's path. Nevertheless, during querying, suchrelated information is difficult to access because the path of a node500 cannot readily be determined.

According to one version of the present invention, a reference to thenode's path is stored in the node 500 itself in the form of the pathidentifier 520. Information related to the node's path is stored in thepath table 600 along with the path identifier 520 associated with thepath. Thus, the path identifier 520 provides a link between the node 500and the path table 600 where information related to the node's path isstored. Consequently, during querying, the DBMS 105 can quickly retrievethe associated information of a node 500.

FIG. 6 is an exemplary path table for an exemplary XML documentaccording to one version of the present invention. The exemplary XMLdocument comprises: <bib ver=“1.0”> <book year=“1994”> <title>TCP/IPIllustrated</title> </book> <book year=“2002”> <title>AdvancedProgramming in the Java environment</title> </book> </bib>

The table 600 comprises a path identifier column 602, a path column 604,and several columns for information related to all nodes with the samepath identifier, e.g., the node 500 associated with the path identifier602 and path 604. For example, an access control column 606 containsaccess control information pertaining to each node 500 via the pathidentifier 602. In one version, the access control column 604 comprisesvalue expressions derived from an access control policy for the document202. Each value expression is a statement that determines whether arequestor is authorized to access the path to the node 500. The valueexpression is preferably compiled before it is stored in the path table600 and therefore the DBMS 105 can execute the value expression atruntime to perform access control checking for a particular node 500.

In addition, the path table 600 can include a schema type column 608that includes a reference to a schema component of an XML Schemacorresponding to the XML document 202. Schema components of an XMLSchema are well known in the art and are defined at www.w3.org (XMLSchema standard). Preferably, the XML schema document is registered andstored in the DBMS 105 in a format such that its components can beidentified.

The path table 600 can also include a statistics column 610 thatcontains information related to the node's path that can be used by theDBMS 105 to optimize execution plans. For example, the statisticalinformation can include distribution statistics that indicate withinwhat range values fall, the number of entries that include the value,and other information that help the DBMS 105 to create efficientexecution plans. The statistical information in the statistics column610 is preferably stored as a binary large object (BLOB).

The path table 600 can also include other columns, e.g., a miscellaneouscolumn 612, that contain other types of information related to thenode's path. Accordingly, detailed information pertaining to a node 500can be captured and easily associated with the node 500 via the pathidentifier 602 without increasing the size of the node 500. Whilenavigating a structured document 202, e.g., during query execution, theDBMS 105 can access information related to any node 500 it traverses bylooking up the path identifier 602 of the node 500 in the path table600.

Embodiments of the present invention associate information related todata in a structured document stored in its native format in a computerimplemented database. Through the aspects of the present invention, astructured document is parsed and a hierarchical node tree comprising aplurality of nodes is generated from the parsed data. Each noderepresents a component in the structured document and includes a pathidentifier. The path identifier is used to directly locate in a pathtable information related to the node during the querying process.

The present invention has been described with reference to certainpreferred versions thereof; however, other versions are possible. Forexample, the storage mechanism 200 can be used for any structureddocument and is not restricted to XML documents, as would be apparent toone of ordinary skill. Other configurations of the path table can alsobe used. For example, the path table is not limited to the types ofinformation illustrated in FIG. 6 and other types of information relatedto the node can be stored in the path table. Further, alternative stepsequivalent to those described for the storage method can also be used inaccordance with the parameters of the described implementation, as wouldbe apparent to one of ordinary skill. Therefore, the spirit and scope ofthe appended claims should not be limited to the description of thepreferred versions contained herein.

1. A method for associating information related to a component in astructured document that is stored in its native format in a databasesystem comprising: a) generating a hierarchical node tree comprising aplurality of nodes, wherein each node represents a component in thestructured document; b) for each node, generating a path associated withthe node, wherein the path follows the hierarchical structure of thestructured document from a root node to the node; c) storing in eachnode an identifier associated with the path for the node; and d)providing a table that correlates the identifier with informationrelated to the associated path, wherein the information applies to thecomponent represented by the node.
 2. A method according to claim 1wherein generating the hierarchical node tree comprises: a1) parsing thestructured document into the plurality of nodes; and a2) linking each ofthe nodes via pointers to form the hierarchical node tree.
 3. A methodaccording to claim 1 further comprising: e) storing each node in atleast one record in the database, wherein each record comprises aplurality of node slots and each node slot includes a pointer pointingto a node.
 4. A method according to claim 1 wherein a node has at leastone child node and comprises at least one child pointer that points toeach of the child nodes.
 5. A method according to claim 1 furthercomprising: e) storing the identifier in the table.
 6. A methodaccording to claim 5 further comprising: f) storing the path in thetable; and g) correlating the identifier with the path associated withthe node.
 7. A method according to claim 1 further comprising: e) foreach path, providing access control information based on an accesscontrol policy for the structured document; f) storing the accesscontrol information in the table; and g) correlating the identifier forthe node with the access control information associated with the path.8. A method according to claim 7 wherein the access control informationassociated with the path is an executable statement indicating who isgranted or denied access to the node.
 9. A method according to claim 1further comprising: e) for each path, determining schema informationassociated with the path; f) storing a reference to the schemainformation in the table; and g) correlating the identifier associatedwith the path with the reference to the schema information associatedwith the path.
 10. A method according to claim 1 further comprising: e)for each path, collecting statistical information associated with thepath; f) storing the statistical information in the table; and g)correlating the identifier for the path with the statistical informationassociated with the path.
 11. A computer readable medium containingprogramming instructions for associating information related to acomponent in a structured document that is stored in its native formatin a database system comprising instructions for: a) generating ahierarchical node tree comprising a plurality of nodes, wherein eachnode represents a component in the structured document; b) for eachnode, generating a path associated with the node, wherein the pathfollows the hierarchical structure of the structured document from aroot node to the node; c) storing in each node an identifier associatedwith the path for the node; and d) providing a table that correlates theidentifier with the information related to the associated path, whereinthe information applies to the component represented by the node.
 12. Acomputer readable medium according to claim 11 wherein generating thehierarchical node tree comprises: a1) parsing the structured documentinto the plurality of nodes; and a2) linking each of the nodes viapointers to form the hierarchical node tree.
 13. A computer readablemedium according to claim 11 further comprising: e) storing each node inat least one record in the database, wherein each record comprises aplurality of node slots and each node slot includes a pointer pointingto a node of the plurality of nodes.
 14. A computer readable mediumaccording to claim 11 wherein a node has at least one child node andcomprises at least one child pointer that points to each of the childnodes.
 15. A computer readable medium according to claim 11 furthercomprising: e) storing the identifier in the table.
 16. A computerreadable medium according to claim 15 further comprising: f) storing thepath in the table; and g) correlating the identifier with the pathassociated with the node.
 17. A computer readable medium according toclaim 11 further comprising: e) for each path, providing access controlinformation based on an access control policy for the structureddocument; f) storing the access control information in the table; and g)correlating the identifier for the node with the access controlinformation associated with the path.
 18. A computer readable mediumaccording to claim 17 wherein the access control information associatedwith the path is an executable statement indicating who is granted ordenied access to the node.
 19. A computer readable medium according toclaim 11 further comprising: e) for each path, determining schemainformation associated with the path; f) storing a reference to theschema information in the table; and g) correlating the identifier forthe path with the reference to the schema information associated withthe path.
 20. A computer readable medium according to claim 11 furthercomprising: e) for each path, collecting statistical informationassociated with the path; f) storing the statistical information in thetable; and g) correlating the identifier for the path with thestatistical information associated with the path.
 21. A system forassociating information related to a component in a structured documentthat is stored in its native format in a database system comprising: a)a computer system coupled to at least one data storage device; b) adatabase management system in the computer system; c) a storagemechanism in the database management system for generating ahierarchical node tree comprising a plurality of nodes, wherein eachnode represents a component in the structured document, wherein thestorage mechanism comprises a path generator for generating a pathassociated with each node, wherein the path follows the hierarchicalstructure of the structured document from a root node to the node andwherein the storage mechanism stores in each node an identifierassociated with the path; and e) a table in the data storage device,wherein the table correlates the identifier with information related tothe associated path, wherein the information applies to the componentrepresented by the node.
 22. A system according to claim 21 wherein thestorage mechanism further comprises a parser for parsing the structureddocument into a plurality of nodes and a node tree generator for linkingeach of the nodes via pointers to form the hierarchical node tree.
 23. Asystem according to claim 21, wherein the storage mechanism stores eachnode in at least one record in the database, wherein each recordcomprises a plurality of node slots and each node slot includes apointer pointing to a node.
 24. A system according to claim 21 whereinthe storage mechanism stores the identifier in the table.
 25. A systemaccording to claim 24 wherein the storage mechanism stores the path inthe table and correlates the identifier with the path associated withthe node.
 26. A system according to claim 21 further comprising accesscontrol information for each path, wherein the access controlinformation is based on an access control policy for the structureddocument, and wherein the storage mechanism stores the access controlinformation in the table and correlates the identifier for the path withthe access control information associated with the path.
 27. A systemaccording to claim 26 wherein the access control information associatedwith the path is an executable statement indicating who is granted ordenied access to the node.
 28. A system according to claim 21 furtherdetermining for each path schema information associated with the path,and wherein the storage mechanism stores a reference to the schemainformation in the table and correlates the identifier for the path withthe reference to the schema information.
 29. A system according to claim21 further comprising means for collecting for each path statisticalinformation associated with the path and wherein the storage mechanismstores the statistical information in the table and correlates theidentifier for the path with the statistical information.